02 | Supervised Learning

Max Pellert (https://mpellert.at)

Deep Learning for the Social Sciences

Github Classroom

From now on, please find all class materials in this repo: https://github.com/DLSS-24/DLSS-24

Supervised learning

Regression

Univariate regression problem (one output, real value)

Supervised learning overview

Supervised learning model = mapping from one or more inputs to one or more outputs

Computing the inputs from the outputs = inference

Example:

Input is age and mileage of secondhand Toyota Prius

Output is estimated price of car

Supervised learning overview

Model is a mathematical equation, but better and more generally: model is a family of equations

Model includes parameters

Parameters affect outcome of equations

Training a model = finding parameters that predict outputs “well” from inputs for a training dataset of input/output pairs

Notation

Check Appendix A of “Understanding Deep Learning”

Input Variables are always indicated with Roman letters
Normal = scalar
Bold = vector
CAPITAL BOLD = matrix
Output

Notation

Model Functions are always indicated with square brackets
Normal = returns scalar
Bold = returns vector
CAPITAL BOLD = returns matrix

Notation examples

Input Structured or tabular data
Output
Model
Parameters Parameters are always Greek letters
Model

Loss function

We use a training dataset of I pairs of input/output examples:

Loss function or cost function measures how bad the model is at relating input to output for the examples:

Or short:

Training

Loss function: returns a scalar that is smaller when model maps inputs to outputs better

During training, we try to find the parameters that minimize the loss:

Testing

To test the model, we evaluate it on a separate test dataset of input/output pairs

Crucially, it must not have seen that data during training (suspiciously high, almost perfect performance on the test set is an indicator that there may have been a spillover)

Testing allows us to see how well it generalizes to “new data”

Testing

Still, the test data is usually from the same domain and collected in the same way as the training data, so external validity can be low although test set performance is high

Always critically assess and try to assess performance “in the wild” to establish the model’s limits

This is clearly where your ways to think from the social sciences can come in very handy!

Example: 1D Linear regression model

Model

Parameters

Loss function:

Least squares loss function

Loss function:

Least squares loss function

Loss function:

Least squares loss function

Loss function:

Least squares loss function

Possible objections?

But you can fit the line model in closed form!

True – but only because we we look at very simple cases so far, we won’t be able to do this for more complex models

But we could exhaustively try every slope and intercept combo!

True – but we won’t be able to do this when there are a million parameters

What do we aim for in the end?

We test with different set of paired input/output data to measure performance

Degree to which we get the same performance as in training = generalization

Might not generalize well because the model is too simple

Or the Model is too complex

It fits to statistical peculiarities of the specific training data we used, not some “general characteristics”

This is known as overfitting

Where is all of this going?

  • Shallow neural networks (a more flexible model)

  • Deep neural networks (an even more flexible model)

  • Loss functions (where did least squares come from?)

  • How to train neural networks (gradient descent and variants)

  • How to measure performance of neural networks (generalization)

Still for today: a practical outlook on Word2Vec

Word2Vec

A shallow neural net, which we will cover next time

Surprising relationships could be found in vector space by computing similarities between word vectors

“[…] simple algebraic operations are performed on the word vectors, [and] it was shown for example that vector(”King”) - vector(“Man”) + vector(“Woman”) results in a vector that is closest to the vector representation of the word “Queen”.”

Although the apporach is dated by now: it was used in a large number of different (published) studies in the social sciences over the last years (sometimes also even today)